For this report, I used the epuRate’s template of Yan Holtz who made it publicly available.
In this project, I will use basic techniques of SPC (Statistical Process Control) which are commonly used in the industry to monitor process stability. I will manually build the Shewhart control charts in order to be able to control the stability of a critical quality parameter of the swiss cheese “Gruyère”: the percentage of milk fat in dry matter (MF / DM).
image taken from EPFL extension school website
Before diving into the analysis, let’s try to understand the structure of our data. The quality attribute of interest is the percentage of milk fat in dry matter (MF/DM). The measurements are taken 5 times a day, for each day of the week (Monday to Friday) and this for 25 weeks. Below are the definitions of the columns:
The data contains measurements for two factories. We will therefore separate them into two separate tables.
# 1. import the data
cheese_data <- read_delim("data/cheese_data.csv", delim = ",")
# rename the factory names for better reproducibility in the graphs
cheese_data <- cheese_data %>%
mutate(factory = recode(factory, f1 = "Factory 1", f2 = "Factory 2"))
# 2 and 3. Separate the factories into two data sets
cheese_data_f1 <- cheese_data %>%
filter(factory == "Factory 1")
cheese_data_f2 <- cheese_data %>%
filter(factory == "Factory 2")
# Create table for f1
cheese_data_f1 %>%
select(-factory) %>%
head(n = 10) %>%
gt() %>%
tab_header(md("<span style='color:#9C5700'>**Factory 1**</span>"),
subtitle = "measurement records (first 10)") %>%
tab_options(
row.striping.background_color = "#FDF2C4",
row.striping.include_stub = NULL,
row.striping.include_table_body = TRUE
)| Factory 1 | ||||||
|---|---|---|---|---|---|---|
| measurement records (first 10) | ||||||
| week | timepoint | m1 | m2 | m3 | m4 | m5 |
| 1 | 1 | 50.0 | 51.0 | 51.9 | 50.5 | 51.5 |
| 1 | 2 | 52.2 | 50.8 | 50.4 | 50.8 | 50.9 |
| 1 | 3 | 50.8 | 50.7 | 51.0 | 50.5 | 51.7 |
| 1 | 4 | 51.4 | 51.1 | 51.8 | 51.8 | 50.8 |
| 1 | 5 | 49.7 | 50.9 | 50.3 | 50.7 | 51.8 |
| 2 | 6 | 52.2 | 51.0 | 49.5 | 50.8 | 51.2 |
| 2 | 7 | 50.9 | 52.6 | 49.7 | 0.0 | 50.1 |
| 2 | 8 | 51.2 | 50.6 | 50.4 | 51.3 | 49.7 |
| 2 | 9 | 52.6 | 50.5 | 51.3 | 51.8 | 51.3 |
| 2 | 10 | 49.7 | 50.3 | 52.0 | 51.8 | 50.4 |
# Create table for f2
cheese_data_f2 %>%
select(-factory) %>%
head(n = 10) %>%
gt() %>%
tab_header(md("<span style='color:#FF6C7A'>**Factory 2**</span>"),
subtitle = "measurement records (first 10)")%>%
tab_options(
row.striping.background_color = "#FFC7CD",
row.striping.include_stub = NULL,
row.striping.include_table_body = TRUE,
)| Factory 2 | ||||||
|---|---|---|---|---|---|---|
| measurement records (first 10) | ||||||
| week | timepoint | m1 | m2 | m3 | m4 | m5 |
| 1 | 1 | 50.5 | 50.6 | 50.5 | 50.4 | 50.4 |
| 1 | 2 | 50.5 | 50.5 | 50.4 | 50.6 | 50.4 |
| 1 | 3 | 50.5 | 50.4 | 50.6 | 50.4 | 50.1 |
| 1 | 4 | 50.5 | 50.3 | 50.8 | 50.5 | 50.4 |
| 1 | 5 | 50.3 | 50.4 | 50.6 | 50.3 | 50.5 |
| 2 | 6 | 51.1 | 51.0 | 50.6 | 50.9 | 50.8 |
| 2 | 7 | 50.8 | 50.7 | 50.7 | 50.8 | 50.7 |
| 2 | 8 | 51.0 | 50.6 | 51.1 | 50.8 | 50.9 |
| 2 | 9 | 50.9 | 50.9 | 51.0 | 50.9 | 50.5 |
| 2 | 10 | 50.9 | 50.7 | 50.9 | 50.9 | 50.8 |
If you look closely at the Factory 1, you can spot a measurements that has the value of 0. Does it mean that at this time the cheese was milk fat free? Probably not. Unfortunately, milk fat measurements are sometimes not performed. In that case, there is a zero value in the system by default. According to the Quality Manager, we should replace these values by the average of the other measurements obtained at the same timepoint. However, if there are 2 missing values or more for a timepoint, the measurements are considered as not representative and the whole values obtained at this timepoint have to be discarded from the analysis. Let’s do it !
# change the data to a long format
cheese_data_long <- cheese_data %>%
pivot_longer(cols = starts_with("m"),
names_to = "measurement_number",
values_to = "milk_fat")
# count the number of values to be replaced and values to be discarded
missing_table <- cheese_data_long %>%
group_by(factory, week, timepoint) %>%
filter(milk_fat == 0) %>%
count(name = "missing_values_per_timepoint") %>%
ungroup() %>%
count(missing_values_per_timepoint, name = "number_times_missing")
missing_values <- missing_table %>%
filter(missing_values_per_timepoint ==1) %>%
pull(number_times_missing)
discarded_values <- missing_table %>%
filter(missing_values_per_timepoint ==2) %>%
pull(number_times_missing)1. How many missing values have to be replaced ?
3 missing values have to be replaced by the average of other measurements obtained at the same timepoint.
2. How many timepoints have to be discarded ?
2 timepoints have to be discarded.
Okay, let’s handle the missing value and do a quick check.
# 3. Spot the rows where there are at least two missing values
to_discard <- cheese_data_long %>%
group_by(factory, week, timepoint) %>%
filter(milk_fat == 0) %>%
count(name = "missing_values_per_timepoint") %>%
filter(missing_values_per_timepoint >= 2) %>%
select(-missing_values_per_timepoint)
# discard the timepoints
cheese_data_long <- cheese_data_long %>%
anti_join(to_discard, by = c("factory", "week", "timepoint"))
# replace the 0 values
cheese_data_long <- cheese_data_long %>%
group_by(factory, week, timepoint) %>%
mutate(milk_fat = if_else(condition = milk_fat == 0,
true = sum(milk_fat)/(n() - 1),
false = milk_fat)) %>%
ungroup()cheese_data_long %>%
filter(milk_fat == 0)It seems that I don’t have any “0” values anymore. Great!
As much as on days without measurements (value of 0), suspicious measurements may occur. This is because the measurements are taken by hand and it is possible that errors have crept in. According one more time to our Quality control manager, all measurements below 15 or above 65 are considered suspicious. Let’s inspect this data.
# finding unexpected values
unexpected_values <- cheese_data_long %>%
filter(!between(milk_fat, 15, 65))
knitr::kable(unexpected_values)| factory | week | timepoint | measurement_number | milk_fat |
|---|---|---|---|---|
| Factory 1 | 4 | 17 | m2 | 505 |
There seems to be 1 suspicious value in our data. In my opinion, it is probably a decimal omission. Instead of 505, the controller was probably thinking of 50.5. I will therefore make the correction in this sense.
# clean the suspicious values above 65
cheese_data_long <- cheese_data_long %>%
mutate(milk_fat = ifelse(milk_fat > 150, milk_fat / 10, milk_fat))Now that we have cleaned up our data, there is still one more thing we need to do before we can start with the control phases: calculating the arithmetic mean and the range for each timepoint.
# select factory 1
cheese_data_long_f1 <- cheese_data_long %>%
filter(factory == "Factory 1")
# 1. store the values in a tibble named spc_data_f1
spc_data_f1 <- cheese_data_long_f1 %>%
group_by(factory, week, timepoint) %>%
mutate(xbar = mean(milk_fat),
Range = max(milk_fat) - min(milk_fat)) %>%
ungroup() %>%
select(-measurement_number, -milk_fat) %>%
unique()
# Create a table with the specifications data of f1
spc_data_f1 %>%
select(-factory) %>%
head(n = 10) %>%
gt() %>%
tab_header(md("<span style='color:#9C5700'>**Factory 1**</span>"),
subtitle = "statistics summary (first 10)") %>%
tab_options(
row.striping.background_color = "#FDF2C4",
row.striping.include_stub = NULL,
row.striping.include_table_body = TRUE
)| Factory 1 | |||
|---|---|---|---|
| statistics summary (first 10) | |||
| week | timepoint | xbar | Range |
| 1 | 1 | 50.980 | 1.9 |
| 1 | 2 | 51.020 | 1.8 |
| 1 | 3 | 50.940 | 1.2 |
| 1 | 4 | 51.380 | 1.0 |
| 1 | 5 | 50.680 | 2.1 |
| 2 | 6 | 50.940 | 2.7 |
| 2 | 7 | 50.825 | 2.9 |
| 2 | 8 | 50.640 | 1.6 |
| 2 | 9 | 51.500 | 2.1 |
| 2 | 10 | 50.840 | 2.3 |
How many rows does the statistics summary table of Factory 1 has ?
The table contains one row per timepoint which corresponds to 123 rows in total.
Let’s now compute the mean statistics.
# 3. compute the average of xbar and Range values of f1
mean_stats_f1 <- spc_data_f1 %>%
summarise(mean_xbar = mean(xbar),
mean_range = mean(Range))
# Create a mean_stats table for f1
mean_stats_f1 %>%
gt() %>%
tab_header(md("<span style='color:#9C5700'>**Factory 1**</span>"),
subtitle = "mean statistics") %>%
tab_options(
row.striping.background_color = "#FDF2C4",
row.striping.include_stub = NULL,
row.striping.include_table_body = TRUE
)| Factory 1 | |
|---|---|
| mean statistics | |
| mean_xbar | mean_range |
| 51.00715 | 1.736585 |
Now, we have everything we need to start with the controls. There are 2 phases when we work with control charts:
During Phase I, the center line and control limits are established. The key thing during Phase I of a control chart is to make sure that the corresponding values were captured while the production process was stable. Let’s try to make control charts for the range and mean values.
# Import xbarR_constants file
XbarR_constants <- read_delim("data/XbarR_constants.csv", delim = ",") %>%
clean_names()
# filter the subgroup size
constants_sub_5 <- XbarR_constants %>%
filter(n == 5)
# computing the LCL and UCL for Range in f1
lcl_range_f1 <- constants_sub_5$d3_2 * mean_stats_f1$mean_range
ucl_range_f1 <- constants_sub_5$d4 * mean_stats_f1$mean_range
# Produce the Range chart function
range_chart <- function(spc_data, mean_stats, lcl_range, ucl_range) {
spc_data %>%
ggplot(mapping = aes(x = timepoint, y = Range)) +
geom_line() +
geom_point() +
geom_hline(aes(yintercept = ucl_range),color = "red") +
geom_hline(aes(yintercept = lcl_range), color = "red") +
geom_hline(aes(yintercept = mean_stats$mean_range), linetype = "dashed") +
geom_text(mapping = aes(x = 110,
y = lcl_range - 0.2,
label = "Lower Control Limit"),
color = "red") +
geom_text(mapping = aes(x = 110,
y = ucl_range + 0.2,
label = "Upper Control Limit"),
color = "red") +
theme_minimal() +
labs(title = stringr::str_glue("Shewhart control charts: **{spc_data$factory}**"),
subtitle = "Measurements Range per Day (Max value - Min value)",
y = "Range (MF/DM)",
x = "Day",
caption = "Source: EPFL extension School") +
theme(plot.title = ggtext::element_markdown())
}
# plot the range chart for factory 1
range_chart(spc_data_f1, mean_stats_f1, lcl_range_f1, ucl_range_f1)According to our graph, the measurements range has remained within the control limits for the last 125 days. Values vary around an average value in a random and symetric way without showing any atypical pattern. In other words, there is only white noise, which is very good news! We can conclude that our process is stable with regards to ranges.
Let’s see what happens with the mean values.
# computing the LCL and UCL for xbar in f1
lcl_xbar_f1 <- mean_stats_f1$mean_xbar - constants_sub_5$a2 * mean_stats_f1$mean_range
ucl_xbar_f1 <- mean_stats_f1$mean_xbar + constants_sub_5$a2 * mean_stats_f1$mean_range
# Produce the xbar chart function
xbar_chart <- function(spc_data, mean_stats, lcl_xbar, ucl_xbar){
spc_data %>%
ggplot(mapping = aes(x = timepoint, y = xbar)) +
geom_line() +
geom_point() +
geom_hline(aes(yintercept = ucl_xbar),color = "red") +
geom_hline(aes(yintercept = lcl_xbar), color = "red") +
geom_hline(aes(yintercept = mean_stats$mean_xbar), linetype = "dashed") +
geom_text(mapping = aes(x = 110,
y = lcl_xbar - 0.2,
label = "Lower Control Limit"),
color = "red") +
geom_text(mapping = aes(x = 110,
y = ucl_xbar + 0.2,
label = "Upper Control Limit"),
color = "red") +
theme_minimal() +
labs(title = stringr::str_glue("Shewhart control charts: **{spc_data$factory}**"),
subtitle = "Measurements Average per Day",
y = "Arithmetic mean (MF/DM)",
x = "Day",
caption = "Source: EPFL extension School") +
theme(plot.title = ggtext::element_markdown())
}
#plot the xbar chart for factory 1
xbar_chart(spc_data_f1, mean_stats_f1, lcl_xbar_f1, ucl_xbar_f1)Here however, we can see that the average measurements per day exceeded the upper control limit on two occasions, on day 19 and 60. Moreover, there seem to be several measurements in a row that are on one side of the centerline. The process does not look stable. To find out for sure, let’s move on to Phase II.
During the Phase II, the monitoring phase, the aim is to detect special cause of variations, which indicate that the process is unstable. There are 2 rules that show evidences of special cause of variation when we use a control chart:
Let’s highlights those special cause of variations
# Create a function that check if all values are equal to 0 or all values are equal to 1
all_mean_ref_value <- function(x) all(x == 0) | all(x == 1)
# adding highlight variable based on the 2 rules
spc_data_f1 <- spc_data_f1 %>%
mutate(above_mean = if_else(xbar > mean(xbar), 1, 0),
outliers = as.factor(case_when(xbar > ucl_xbar_f1 ~ 1,
xbar < lcl_xbar_f1 ~ 1,
slide_lgl(.x = above_mean, # check if there is 10 times the same values before
.f = all_mean_ref_value,
.before = 9,
.complete = TRUE) ~ 2,
TRUE ~ 0)))
# Create xbar chart fucntion with special cause highlights
xbar_chart_high <- function(spc_data, mean_stats, lcl_xbar, ucl_xbar) {
spc_data %>%
ggplot(mapping = aes(x = timepoint, y = xbar)) +
geom_line() +
geom_point(aes(color = outliers)) +
geom_hline(aes(yintercept = ucl_xbar),color = "red") +
geom_hline(aes(yintercept = lcl_xbar), color = "red") +
geom_hline(aes(yintercept = mean_stats$mean_xbar), linetype = "dashed") +
geom_text(mapping = aes(x = 110,
y = lcl_xbar - 0.2,
label = "Lower Control Limit"),
color = "red") +
geom_text(mapping = aes(x = 110,
y = ucl_xbar + 0.2,
label = "Upper Control Limit"),
color = "red") +
scale_color_manual(values = c("black", "red", "green")) +
theme_minimal() +
labs(title = stringr::str_glue("Shewhart control charts: **{spc_data$factory}**"),
subtitle = "Measurements Average per Day",
y = "Arithmetic mean (MF/DM)",
x = "Day",
caption = "Source: EPFL extension School") +
theme(plot.title = ggtext::element_markdown(),
legend.position = "none")
}
xbar_chart_high(spc_data_f1, mean_stats_f1, lcl_xbar_f1, ucl_xbar_f1)Now let’s do the same for factory 2.
# Select data for factory 2 only
cheese_data_long_f2 <- cheese_data_long %>%
filter(factory == "Factory 2")
# calculate the spc_data for f2
spc_data_f2 <- cheese_data_long_f2 %>%
group_by(factory, week, timepoint) %>%
mutate(xbar = mean(milk_fat),
Range = max(milk_fat) - min(milk_fat)) %>%
ungroup() %>%
select(-measurement_number, -milk_fat) %>%
unique()
# compute the average of xbar and Range values for f2
mean_stats_f2 <- spc_data_f2 %>%
summarise(mean_xbar = mean(xbar),
mean_range = mean(Range))
# compute the UCL and LCL range for f2
lcl_range_f2 <- constants_sub_5$d3_2 * mean_stats_f2$mean_range
ucl_range_f2 <- constants_sub_5$d4 * mean_stats_f2$mean_range
# Produce a range chart for f2
range_chart(spc_data_f2, mean_stats_f2, lcl_range_f2, ucl_range_f2)# compute the UCL and LCL xbar for f2
lcl_xbar_f2 <- mean_stats_f2$mean_xbar - constants_sub_5$a2 * mean_stats_f2$mean_range
ucl_xbar_f2 <-mean_stats_f2$mean_xbar + constants_sub_5$a2 * mean_stats_f2$mean_range
# adding the highlight variable
spc_data_f2 <- spc_data_f2 %>%
mutate(above_mean = if_else(xbar > mean(xbar), 1,0) ,
outliers = as.factor(case_when(xbar > ucl_xbar_f2 ~ 1,
xbar < lcl_xbar_f2 ~ 1,
slide_lgl(.x = above_mean,
.f = all_mean_ref_value,
.before = 9,
.complete = TRUE) ~ 1,
TRUE ~ 0)))
# Produce a xbar chart for f2
xbar_chart_high(spc_data_f2, mean_stats_f2, lcl_xbar_f2, ucl_xbar_f2)Here we can see that the distribution tends to be bimodal with high mean on one week and low mean on the next week. The values are so extreme that we wonder if we chose the right LCL and UCL. In fact, it depends on how we calculate our limits. Indeed, our limits are based on our mean range per week. As this is small, our limits are also small. Here, we do not take into account the fact that the mean range can be very different from one week to another.
Let’s try to produce the same chart but per week instead of per day.
# Create xbar chart with values per week function
xbar_week_chart <- function(spc_data, mean_stats, lcl_xbar, ucl_xbar) {
spc_data %>%
ggplot(mapping = aes(x = week, y = xbar)) +
geom_point() +
geom_hline(aes(yintercept = ucl_xbar),color = "red") +
geom_hline(aes(yintercept = lcl_xbar), color = "red") +
geom_hline(aes(yintercept = mean_stats$mean_xbar), linetype = "dashed") +
geom_text(mapping = aes(x = 25,
y = lcl_xbar - 0.2,
label = "Lower Control Limit"),
color = "red") +
geom_text(mapping = aes(x = 25,
y = ucl_xbar + 0.2,
label = "Upper Control Limit"),
color = "red") +
theme_minimal() +
labs(title = stringr::str_glue("Shewhart control charts: **{spc_data$factory}**"),
subtitle = "Measurements Average per Week",
y = "Arithmetic mean (MF/DM)",
x = "Week",
caption = "Source: EPFL extension School") +
theme(plot.title = ggtext::element_markdown())
}
# xbar charts per week for factory 1
xbar_week_chart(spc_data_f1, mean_stats_f1, lcl_xbar_f1, ucl_xbar_f1)# xbar values per week for factory 2
xbar_week_chart(spc_data_f2, mean_stats_f2, lcl_xbar_f2, ucl_xbar_f2)On factory 2, we see clearly that the range within a week is really small, but the values differ greatly between the weeks. Let’s try to calculate control limits considering the data weekly
# calculate new weekly spc_data for f2
spc_data_week_f2 <- cheese_data_long_f2 %>%
group_by(factory, week) %>%
summarise(xbar = mean(milk_fat),
Range = max(milk_fat) - min(milk_fat))
# Creating the moving average column
spc_data_week_f2 <- spc_data_week_f2 %>%
mutate(prev_Range = lag(Range),
MR = abs(Range - prev_Range))
# calculate the new mean stats
mean_stats_week_f2 <- spc_data_week_f2 %>%
summarise(mean_xbar = mean(xbar),
MR_bar = mean(MR, na.rm = TRUE))
# calculate the new LCL and UCL for factory 2
lcl_week_f2 <- mean_stats_week_f2$mean_xbar - 2.66 * mean_stats_week_f2$MR_bar
ucl_week_f2 <- mean_stats_week_f2$mean_xbar + 2.66 * mean_stats_week_f2$MR_bar
# produce the xb
xbar_week_chart(spc_data_week_f2, mean_stats_week_f2, lcl_week_f2, ucl_week_f2)Okay, let’s go for a bit of freestyling.
Here are my milestones to reach:
For this project, I would like to see how much time I spend on my phone. I estimate that I should spend about 2 hours a day. Let’s try to create a control chart to see if I’m meeting the two hours per day. I have been able to collect data for the last 27 days. Let’s import the data.
library(readxl)
library(lubridate)
time_spent <- read_excel("data/time_spent_iphone.xlsx")
time_spent <- time_spent %>%
mutate(frac_min_hour = round(Minutes / 60, digits = 1)) %>%
mutate(duration = round(Hours + frac_min_hour, digits = 1))
# Create the table of my data
time_spent %>%
select(-frac_min_hour) %>%
head(n = 10) %>%
gt() %>%
tab_header("Time Spent on my Iphone",
subtitle = "measurement records from 05.April 2021 to 01 May 2021 (first 10) ") %>%
opt_row_striping()| Time Spent on my Iphone | ||||
|---|---|---|---|---|
| measurement records from 05.April 2021 to 01 May 2021 (first 10) | ||||
| Date | Hours | Minutes | weather | duration |
| 2021-05-01 | 1 | 35 | rainy | 1.6 |
| 2021-04-30 | 3 | 37 | rainy | 3.6 |
| 2021-04-29 | 7 | 29 | rainy | 7.5 |
| 2021-04-28 | 3 | 8 | rainy | 3.1 |
| 2021-04-27 | 2 | 12 | sunny | 2.2 |
| 2021-04-26 | 2 | 3 | rainy | 2.0 |
| 2021-04-25 | 1 | 51 | sunny | 1.8 |
| 2021-04-24 | 4 | 56 | sunny | 4.9 |
| 2021-04-23 | 3 | 8 | sunny | 3.1 |
| 2021-04-22 | 1 | 37 | sunny | 1.6 |
Perfect, I have now imported my data. For the creation of my control limits, I will use the 2-sigma limits of the control chart. For this, I will firstly need to calculate the mean and the standard deviation of my data. I will then be able to calculate the LCL and UCL using the following formula:
UCL = m + 2*SD
LCL = m - 2*SD
# calculate the mean and the standard deviation
iphone_mean_stats <- time_spent %>%
summarise(mean = mean(duration),
sd = sd(duration))
# calculate the control limits
ucl_iphone = iphone_mean_stats$mean + 2 * iphone_mean_stats$sd
lcl_iphone = ifelse(iphone_mean_stats$mean - 2 * iphone_mean_stats$sd < 0,
0,
iphone_mean_stats$mean - 2 * iphone_mean_stats$sd) # this is because I can't have negative hours spent
# add the color if special cause of variation
time_spent <- time_spent %>%
mutate(outliers = as.factor(case_when(duration> ucl_iphone ~ 1,
duration < lcl_iphone ~ 1,
TRUE ~ 0)))
text_outlier <- tibble(label = "this is an outlier", x = "2021-04-22", y = 7.5)
# Creating the plot
time_spent %>%
ggplot(mapping = aes(x = as.factor(Date), y = duration)) +
geom_point(aes(color = outliers)) +
geom_hline(aes(yintercept = ucl_iphone),color = "red") +
geom_hline(aes(yintercept = lcl_iphone), color = "red") +
geom_hline(aes(yintercept = iphone_mean_stats$mean), linetype = "dashed") +
geom_text(mapping = aes(x = 25,
y = lcl_iphone - 0.4,
label = "Lower Control Limit"),
color = "red") +
geom_text(mapping = aes(x = 25,
y = ucl_iphone + 0.4,
label = "Upper Control Limit"),
color = "red") +
geom_text(mapping = aes(x = x, y = y, label = label ), data = text_outlier) +
geom_segment(aes(x = "2021-04-25", y = 7.5, xend = "2021-04-28", yend = 7.5),
arrow = arrow(length = unit(0.5, "cm"))) +
scale_color_manual(values = c("black", "red")) +
theme_minimal() +
labs(title = stringr::str_glue("Shewhart control charts: **Valentin's time spent on Iphone**"),
subtitle = "Average per Day",
y = "Hours",
x = "Day",
caption = "Source: Iphone") +
theme(plot.title = ggtext::element_markdown(),
axis.text.x = element_text(angle = 90),
legend.position = "none")With this graph, we notice that this process cannot be considered as stable (unfortunately for me! Well… I am not a machine after all). Indeed, there is first of all an outlier in April 29th where I spent more than 7.5 hours. (I actually played half the night a game on the iPhone instead of working on my project ;).
The second issue is that these hours are not “white noise”. There seems to be a pattern over the days and even weekends. Let’s try again with the days of the week as timepoints.
# re-ordering the day of the week
time_spent <- time_spent %>%
mutate(weekday = weekdays(Date),
weekday = factor(weekday, levels = c("Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday")))
#Plot the weekday Chart
time_spent %>%
ggplot(mapping = aes(x = weekday, y = duration)) +
geom_point(aes(color = outliers)) +
geom_hline(aes(yintercept = ucl_iphone),color = "red") +
geom_hline(aes(yintercept = lcl_iphone), color = "red") +
geom_hline(aes(yintercept = iphone_mean_stats$mean), linetype = "dashed") +
geom_text(mapping = aes(x = 8,
y = lcl_iphone - 0.4,
label = "Lower Control Limit"),
color = "red") +
geom_text(mapping = aes(x = 8,
y = ucl_iphone + 0.4,
label = "Upper Control Limit"),
color = "red") +
scale_color_manual(values = c("black", "red")) +
scale_x_discrete(expand = c(0, 2)) +
theme_minimal() +
labs(title = stringr::str_glue("Shewhart control charts: **Valentin's time spent on Iphone**"),
subtitle = "Average per Day",
y = "Hours",
x = "Day",
caption = "Source: Iphone") +
theme(plot.title = ggtext::element_markdown(),
axis.text.x = element_text(angle = 90),
legend.position = "none")Here again, it is difficult to draw clear conclusions. What I can say is that Saturdays are the most variable day of the week and Sundays are the days when I use my phone the least. Finally, the outlier on Thursday is apparently unusual.
One last thing we could look at is whether the time screen spent is linked with the weather. I will separate colors into two categories.
time_spent %>%
ggplot(mapping = aes(x = as.factor(Date), y = duration)) +
geom_point(aes(color = weather)) +
geom_hline(aes(yintercept = ucl_iphone),color = "red") +
geom_hline(aes(yintercept = lcl_iphone), color = "red") +
geom_hline(aes(yintercept = iphone_mean_stats$mean), linetype = "dashed") +
geom_text(mapping = aes(x = 25,
y = lcl_iphone - 0.4,
label = "Lower Control Limit"),
color = "red") +
geom_text(mapping = aes(x = 25,
y = ucl_iphone + 0.4,
label = "Upper Control Limit"),
color = "red") +
scale_color_manual(values = c("blue", "yellow")) +
theme_minimal() +
labs(title = stringr::str_glue("Shewhart control charts: **Valentin's time spent on Iphone**"),
subtitle = "Average per Day based on weather",
y = "Hours",
x = "Day",
caption = "Source: Iphone") +
theme(plot.title = ggtext::element_markdown(),
axis.text.x = element_text(angle = 90))That’s it for this project. Thank you very much!
A work by by Valentin Monney